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21 Abstract 

22 Middle East Respiratory Syndrome-related Coronavirus (MERS-CoV) spread to humans via the 

23 zoonotic transmission from camels. MERS-CoV belongs to the lineage C of Betacoronaviruses 

24 (betaCoVs), which also includes viruses isolated from bats and hedgehogs. A large portion of the 

25 betaCoV genome consists of two open reading frames (ORFla and ORFlb) that are translated into 

26 polyproteins. These are cleaved by viral proteases to generate 16 non-structural proteins (nspl-16) 

27 which compose the viral replication-transcription complex. We investigated the evolution of ORFla 

28 and ORFlb in lineage C betaCoVs. Results indicated widespread positive selection, acting mostly 

29 on ORF1 a. The proportion of positively selected sites in ORF1 a was much higher than that 

30 previously reported for the surface-exposed spike protein. Selected sites were unevenly distributed, 

31 with nsp3 representing the preferential target. Several pairs of co-evolving sites were also detected, 

32 possibly indicating epistatic interactions; most of these were located in nsp3. Adaptive evolution at 

33 nsp3 is ongoing in MERS-CoV strains and two selected sites (G720 and R911) were detected in the 

34 protease domain. Whereas position 720 is variable in camel-derived viruses, suggesting that the 

35 selective event does not represent a specific adaptation to humans, the R911C substitution was only 

36 observed in human-derived MERS-CoV isolates, including the viral strain responsible for the recent 

37 South Korean outbreak. It will be extremely important to assess whether these changes affect host 

38 range or other viral phenotypes. More generally, data herein indicate that CoV nsp3 represents a 

39 major selection target and nsp3 sequencing should be envisaged in monitoring programs and field 

40 surveys. 
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43 Importance 

44 Both SARS-CoV and MERS-CoV originated in bats and spread to humans via an intermediate host. 

45 This clearly highlights the potential for coronavirus host shifting and the relevance of understanding 

46 the molecular events underling the adaptation to new host species. We investigated the evolution of 

47 ORFla and ORFlb in lineage C betaCoVs and in 87 sequenced MERS-CoV isolates. Results 

48 indicated widespread positive selection, stronger in ORFla than in ORFlb. Several selected sites 

49 were found to be located in functionally relevant protein regions and some of them corresponded to 

50 functional mutations in other coronaviruses. The proportion of selected sites we identified in ORFla 

51 is much higher than that for the surface-exposed spike protein. This observation suggests that 

52 adaptive evolution in ORFla might contribute to host shifts or immune evasion. Data herein also 

53 indicate that genetic diversity at non-structural proteins should be taken into account when antiviral 

54 compounds are developed. 

55 
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56 Introduction 

57 

58 Middle East respiratory syndrome-related coronavirus (MERS-CoV, 

59 http://talk.ictvonline.org/files/proposals/animal ssma viruses/default.aspx ) was identified as the 

60 causative agent of a new viral respiratory disease in Saudi Arabia in June 2012 (1). Since then, more 

61 than 1,500 cases and 571 deaths have been reported worldwide (as of October 12 th , 2015 

62 http://www.who.int/csr/don/12-october-2015-mers-saudi-arabia/en/), although major outbreaks have 

63 been confined to the Middle East and, more recently, to South Korea. The rate of human-to-human 

64 transmission of MERS-CoV is relatively low, suggesting that a zoonotic reservoir serves as a major 

65 source for transmission (2). Recent studies have indicated that MERS-CoV or a closely related virus 

66 originated in bats and possibly spread to humans via transmission from dromedary camels (3). 

67 Like SARS-CoV, which causes Severe Acute Respiratory Syndrome and evolved in bats as well, 

68 MERS-CoV (order Nidovirales, family Coronaviridae, subfamily Coronavirinae) is a positive- 

69 sense RNA (+ssRNA) virus belonging to the C lineage of the Betacoronavirus (betaCoVs) genus 

70 (4). CoVs are exceptional among RNA viruses for having long (~30-kb) genomes, a feature 

71 associated with a specific genome architecture and with the acquisition of an RNA 3'-to-5' 

72 exoribonuclease activity (exoN) (5). About two thirds of the CoV genome consist of two large 

73 overlapping open reading frames (ORFla and ORFlb), that are translated into the polyproteins 

74 ppla and pplab (this latter synthesized via a -1 ribosome frameshift at the 3' end of ORFla). These 

75 polyproteins are subsequently cleaved by viral proteases to generate 16 non structural proteins 

76 (nspl to 16), most of which compose the viral replication-transcription complex (RTC). With the 

77 exception of nspl and 2, whose functions are poorly understood, most nsps have been characterized 

78 in some detail for either MERS-CoV, SARS-CoV or mouse hepatitis virus (MF1V). Thus, nsp3, a 

79 large multidomain and multifunctional protein, was shown to play essential roles in the virus 
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80 replication cycle. In fact, the papain-like protease (PLpro) activity of nsp3 is responsible for the 

81 initial processing of ppla. Also, nsp3 together with nsps4 and 6, recruits intracellular membranes to 

82 anchor the RTC and to form a reticulovesicular network of double-membrane vesicles (DVMs) and 

83 convoluted membranes where viral RNA replication occurs (6). nsp5 encodes a second viral 

84 protease (3C-like protease, 3CLpro) that cleaves both ppla and pplab to the final nsp products. 

85 nsps7-l 1 provide the primer-making activity and regulate the function of the main RNA-dependent 

86 RNA polymerase (RdRp), this latter encoded by nsp 12. Finally, nsp 13-16 comprise RNA-modifying 

87 enzymes, including the exoN activity in nsp 14 (7, 8). 

88 Viral RNA represents the major pathogen-associated molecular pattern (PAMP) recognized by the 

89 host immune system during CoV infection (9). Both MERS-CoV and SARS-CoV elicit limited 

90 interferon (IFN) response in most cell types, indicating that these viruses have evolved efficient 

91 strategies to evade innate immune sensing and/or to block IFN induction (9). Indeed, these viruses 

92 express antagonists of the IFN response including SARS-CoV ORF6, ORF3b, and nucleoprotein, as 

93 well as MERS-CoV structural and accessory proteins M, ORF4a, ORF4b, and ORF5 (10-13). 

94 Additional immune evasion strategies, though, rely on nsps. In fact, the enzymatic activities of 

95 nspl4 and nspl6 endow the viral RNA of a 2'-0-methylated cap structure that mimics cellular 

96 mRNAs and avoids activation of the innate immunity receptors RIG-I and MDA5 (9). In analogy to 

97 the exoN and endoribonucleases expressed by other viruses such as Lassa Fever Virus and 

98 Pestivirus (14, 15), the ribonuclease activities of nsp 14 and nsp 15 are also thought to play a role in 

99 immune escape by digesting RNA PAMPs (9). Moreover, suppression of IFN responses is mediated 

100 by the PLpro in nsp3 through its deubiquitinating and delSGylating activities (16, 17), as well as by 

101 nspl. The latter inhibits IFN-dependent signaling by decreasing the phosphorylation levels of 

102 STAT1 and suppresses host protein synthesis (18, 19). Finally, PLpro was shown to physically 

103 interact with TRAF3, TBK1, IKKe, STING, and IRF3, which represent key cellular components for 
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104 IFN response (20). 

105 Therefore, the information encoded by CoV ORFla and ORFlb is essential for viral replication and 

106 for immune evasion. For these reasons, inhibitors that interfere with viral enzymatic activities (e.g. 

107 proteases, RdRp) are regarded as promising candidates for therapeutic intervention (21). 

108 From an evolutionary standpoint, different observations suggest that nsps may represent targets of 

109 natural selection. First, genes encoding molecules that directly interact with the host immune 

110 system are thought to be preferential targets of natural selection as a consequence of host-pathogen 

111 arms races (22). Second, adaptation to new hosts in other RNA viruses has been associated to 

112 selective changes in polymerase genes (23, 24). Finally, the acquisition of a complex replication 

113 machinery is evolutionary linked to genome expansion in Nidovirales (5). Nonetheless, 

114 evolutionary studies have mainly focused on the analysis of betaCoV spike proteins, as these are 

115 surface exposed and represent major determinants of host range and tissue tropism (25). Flerein we 

116 investigated the evolution of ORFla and ORFlb in MERS-CoV and in lineage C betaCoVs isolated 

117 from bats and hedgehogs. Results indicate widespread positive selection, stronger in ORFla; within 

118 this region, nsp3 represents a preferential selection target and adaptive evolution is ongoing in 

119 MERS-CoV strains circulating in the current outbreak. 

120 Material and Methods 

121 

122 Sequences and alignments 

123 ORFla/ORFlb sequences for 7 lineage C betaCoVs and 87 MERS-CoV strains (available as of July 

124 2015) were retrieved from the NCBI database; a list of accession numbers for the complete 

125 genomes is provided in Table SI. 

126 Alignment errors are common when divergent sequences are analyzed and can affect evolutionary 

127 inference. Thus, we used PRANK (26) to generate multiple sequence alignments and GUIDANCE 
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128 (27) for filtering unreliably aligned codons (i.e. we masked codons with a score <0.90), as 

129 suggested (28). 

130 The alignments were screened for the presence of recombination events using two methods based 

131 on distinct data features: 1) CARD (Genetic Algorithm Recombination Detection) (29) uses 

132 phylogenetic incongruence among segments in the alignment to detect the best-fit number and 

133 location of recombination breakpoints; the statistical significance of putative breakpoints is then 

134 evaluated through Kishino-Hasegawa (HK) tests; 2) GENECONV (30) tests for significant 

135 clustering of substitutions along sequences; statistical significance is assessed through permutation 

136 with multiple-comparison correction. For both methods recombination breakpoints were considered 

137 significant if the p value was < 0.05. No breakpoint was detected in any analysis. 

138 

139 Detection of positive selection 

140 

141 Gene trees were generated by maximum-likelihood using the program phyML with a GTR plus 

142 gamma-distributed rates model and 4 substitution rate categories(31). 

143 Positive selection can be defined when the nonsynonymous/synonymous rate ratio (co) is higher 

144 than 1; to analyze the presence of episodic positive selection in lineage C betaCoVs viruses we 

145 applied the branch-site test (32) from the PAML suite (33). The test is based on the comparison 

146 between two nested models: a model (MA) that allows positive selection on one or more lineages 

147 (called foreground lineages) and a model (MAI) that does not allow such positive selection. Twice 

148 the difference of likelihood for the two models (Ain) is then compared to a y 2 distribution with one 

149 degree of freedom (32). A false discovery rate correction was applied to take into account a multiple 

150 hypothesis issue generated by analyzing different branches on the same phylogeny (34). When the 

151 likelihood ratio test suggested the action of positive selection, the Bayes Empirical Bayes (BEB) 
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analysis was used to evaluate the posterior probability that each codon belongs to the site class of 


positive selection on the foreground branch. 

BUSTED (branch-site unrestricted statistical test for episodic diversification) (35) is a recently 
developed software designed to describe episodic positive selection that is acting on specific 
branches in the phylogeny at a proportion of sites within the alignment. An alternative model that 
allows the action on positive selection on foreground branches is compared with a null model that 
doesn't allow co >1 . Twice the Ain of the two models is then compared to a % 2 distribution (degrees 
of freedom=2); if the null model is rejected, at least one site is under positive selection on the 
foreground branch(es). To detect selection at individual sites, twice the difference of the likelihood 
for the alternative and the null model at each site is compared to a % 2 distribution (degree of 
freedom=l). BUSTED is implemented in the HYPE1Y package (36). 

Conservatively, we considered a site as selected if it showed a p value < 0.05 in BUSTED and a 
posterior probability > 0.90 in the BEB analysis. 

The site models implemented in PAML were applied for the analysis of nsp3 sequences from 
MERS-CoVs isolates. To detect selection, two different pairs of nested site models (Mla/M2a and 
M7/M8) were fitted to the data (33); the M2a and M8 allow a class of sites to evolve with co>l, 
whereas Mia and M7 do not. Positively selected sites were identified using the BEB analysis (from 
model M8) (37). Sites were validated using MEME (38) (with a cutoff <0.1), which allows the 
distribution of co to vary from site to site and from branch to branch at a site. 

MEME (38) analyses were performed through the DataMonkey server (39). 


Detection of co-evolving sites 

To detect co-evolving sites in the ORFla and ORFlb alignments we applied two different methods: 
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BGM (Bayesian Graphical Model)-Spidermonkey (40) and the Mutual Information Server To Infer 
Coevolution (MISTIC) (41). Spidermonkey is a tool implemented in the HYPHY package that 
identifies co-evolving sites from an alignment of coding sequences; a BGM is used to evaluate the 
connection among codons (represented by the nodes of the network). Significant statistical 
associations between nodes are indicated by the edges of the network, suggesting functional or 
structural interactions between codons. 

MISTIC estimates the relationship between two or more position in an alignment. The co¬ 
evolutionary association is estimated by Mutual Information (MI), that evaluates how much the 
information from the aminoacid at the first position can help to predict the aminoacid identity at the 
second position. 

For BGM-Spidermonkey sites were filtered based on a minimum count of 4 substitutions across the 
phylogeny. To be conservative, we considered a pair of residues as co-evolving if they showed a 
posterior probability >0.75. This threshold corresponds to 0.02% and 1.42% of all analyzed site 
pairs in ORFla and ORFlb, respectively. Likewise, for MISTIC site pairs were required to display 
a MI rank higher that the 99 th percentile calculated using all MI scores from the alignment. Pairs of 
sites exceeding the thresholds for both methods were declared to be co-evolving. 

Membrane topology, glycosylation site predictions, and 3D structure mapping 

The membrane protein topology for MERS-CoV nsp3 and nsp4 was predicted by using TMF1MM 
(http://www.cbs.dtu.dk/services/TMFlMM/) (42). N-Glycosylation sites were predicted with 
NetNGlyc (http://www.cbs.dtu.dk/services/NetNGlyc/), a program that uses artificial neural 
networks to examine the sequence context of Asn-X-Ser/Thr motifs. 

Sites were mapped onto structures using PyMOL (The PyMOL Molecular Graphics System, 
Version 1.5.0.2 Schrodinger, LLC). 
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Results 

nsp3 in ORFla is a major selection target in betaCoVs 

The lineage C of betaCoVs includes two bat species closely related to MERS-CoV, namely Ty- 
BatCoV HKU4 and Pi-BatCoV HKU5, isolated from the lesser bamboo bats (Tylonycteris 
pacliypus ) and Japanese pipistrelles (Pipistrellus abramus), respectively (4). Additional viruses 
belonging to the lineage C of betaCoVs have been described in bats (BtCoV/133, BtVs- 
BetaCoV/SC2013) and hedgehogs (Hedgehog coronavirus, EriCoV) (21, 43-45). Recently, a virus 
belonging to the same species as MERS-CoV was isolated in Neoromicia bats (NeoCoV) (46). To 
investigate the evolutionary history of ORF1 a, we obtained sequence information for these viruses 
and for 6 MERS-CoV strains isolated from either humans or camels and belonging to the major 
groups described to date (47) (Fig. 1). The sequence alignment was pruned of unreliably aligned 
codons (see Material and Methods), a procedure that resulted in the masking of the almost entire 
acidic domain in nsp3. Indeed, this region was previously shown to be highly divergent among 
CoVs (48). We next analyzed the alignment for the presence of recombination breakpoints using 
GARD (Genetic Algorithm Recombination Detection) (29) and GENECONV (30). No evidence of 
recombination was detected. 

The pylogenetic tree of ORFla obtained with phyML was consistent with previously reported ones 
(46, 47). An estimate of the extent of functional constraint along ORFla was obtained by 
identification of negatively selected sites (total number =903) and calculation of their distribution 
among nsps. This analysis indicated that the average fraction of negatively selected sites is -0.24, 
with weakest selection in nspl and strongest constraint in nsps6-9 (Fig. 1). 

Evidence of episodic positive selection along the internal branches of the ORFla phylogeny was 
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224 searched for using branch-site tests. Specifically, we applied two different methods: the branch-site 

225 unrestricted statistical test for episodic diversification (BUSTED) (35) and the maximum-likelihood 

226 models (MA/MA1) implemented in the PAML suite (33). These two approaches rely on different 

227 assumptions of co (nonsynonymous/synonymous rate ratio) variation among branches. To be 

228 conservative, episodic positive selection at each tested branch was declared only if statistically 

229 significant support was obtained with both methods. Using this criterion, we found 4 branches with 

230 evidence of episodic selection (Fig. 1, Table 1). Selected sites along these branches were identified 

231 using the Bayes empirical Bayes (BEB) procedure from model MA and with BUSTED; again, only 

232 sites detected by both methods were considered. A total of 55 selected sites were detected; these 

233 were scattered along the entire ORF1 a and located in different nsps, with the exclusion of nsp7 and 

234 nsp9, which were not targeted by selection (Fig. 1, Table S2). To determine whether any nsp 

235 represented a preferential target of episodic positive selection, we performed random uniform 

236 sampling (i.e. assuming a random distribution of selected sites, we calculated the likelihood of 

237 identifying a number of sites equal to or higher than the one we observed for each nsp). This 

238 analysis indicated that nsp3 was preferentially targeted by episodic selection (Bonferroni corrected 

239 p value= 0.010) during the evolution of MERS-related CoVs. 

240 Finally, we searched for evidence of co-evolution between sites in the ORFla alignment. To this 

241 aim, we applied BGM-Spidermonkey (40) and MISTIC (41) (see Material and Methods for details). 

242 Six pairs of co-evolving sites were detected by both methods. Most site pairs were located in nsp3 

243 (Fig. 1, Table S3). 

244 

245 Weaker selective pressure for ORFlb 

246 

247 The same approach described above was applied to ORFlb sequences. The degree of constraint in 
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248 nspl2-16 was comparable to that observed for ORFla (Fig. 1A). Statistical support of episodic 

249 positive selection was detected for 2 branches only (Fig. IB, Table 1) and fewer selected sites were 

250 detected compared to ORF la. Most sites were targeted by selection on the branch leading to bat 

251 CoVs (F1KU5, BtCoV/133, HKU4) (Fig. 1, Table S2). No nsp resulted a preferential selection 

252 target. One pair of co-evolving sites was detected by MISTIC and BGM-Spidermonkey (Fig. 1 A, 

253 Table S3). 

254 

255 Ongoing selection at nps3 in MERS-CoV 

256 

257 Given the results above, we investigated whether positive selection also occurred at nsp3 during the 

258 recent evolution of MERS-CoV. To this purpose, we retrieved 87 sequences from MERS-CoVs 

259 isolated from camels or humans (available strains as of July 2015) (Table SI). No recombination 

260 breakpoint was detected by either GARD or GENECONV and the codeml site models were applied 

261 (33). Results showed that models of gene evolution that allow a class of codons to evolve with co >1 

262 (NSsite models M2a and M8) better fit the data than the neutral models (NSsite models Mia and 

263 M7), strongly supporting the action of positive selection (Table 2). Positively selected sites in nsp3 

264 were identified using two methods, the BEB procedure implemented in M8 and MEME (Mixed 

265 Effects Model of Evolution). Two sites (G720 and R911) were detected by both methods (Table 2, 

266 Fig. 2). Four different residues are observed at position 720 in MERS-CoV sequences, and three of 

267 them are present in viruses isolated from camels (Fig. 2), suggesting that adaptive evolution at this 

268 site started before the spread to humans. Conversely, the R911C substitution only occurs in viruses 

269 derived from humans: all of them belong to group 3 and include the virus responsible for the recent 

270 South Korean outbreak (MERS-CoV/KOR/KNIH) (Fig. 2). 

271 
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Positively selected sites in the context of betaCoV biology 


To shed light into the functional role of selected sites we exploited available structural, genetic, and 
biochemical data for nsps (mainly obtained for SARS-CoV and MHV), as well as in silico 
prediction of transmembrane helices and glycosylation sites. 

Most positively selected sites were located in nsp3 (Fig. 3 A). In the PLpro domain, four of the 
positively selected sites, including two that are selected in the viral stains from the current MERS- 
CoV epidemic, were clustered in a spatially confined region opposed to the catalytic crevice. This 
region corresponds to the “palm” of the right-handed architecture of the protease, whereas three 
additional sites are located on the “fingers” (Fig. 3A). One of these three surface exposed sites, 
Q830, is part of a non-canonical nsp5 cleavage site, unique for MERS-CoV, that could contribute to 
the formation of new cleavage products (49) (Fig. 3 A). 

Additional sites were identified in the other domains of the protein (Fig. 3 A). One positively 
selected site in the Ubll domain (C91) precedes a conserved acidic loop that when mutated in 
SARS-CoV determines a lethal phenotype (50). Among sites in the ADPR domain, none was 
located within the ADP-ribose binding pocket (Fig. 3A) (51). As for the nucleic-acid binding 
domain (NAB) (Fig. 3 A), one of the positively selected site (T1080) is part of the a 2 helix, that 
together with the loop preceding the (3 6 strand, forms a structure possibly interacting with ssRNA 
via a charged patch (52). 

Finally, positively selected sites were found in the so-called transmembrane region of nsp3 (Fig. 3). 
We performed an in silico prediction of transmembrane helices and glycosylation sites. As 
previously observed for MF1V and SARS-CoV (53), the topology model did not fit the general 
Nendo/Cendo shared structure and was inconsistent with the location of the glycosylation site. 
Thus, in analogy to other CoVs, some hydrophobic regions predicted to be transmembrane are 
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unlikely to span the lipid bilayer. Intriguingly, we identified two positively selected sites in the 
lumenal loop; one of them (A1386) is located between the first two conserved cysteine residues 
(Fig. 3B) of the zinc finger motif (ZF). Other positively selected sites were found in the third 
hydrophobic region (that may or may not span the membrane). 

In nsp4, four transmembrane regions were predicted (in SARS-CoV and MHV all of these are 
membrane-spanning) (54) (Fig. 3C). One of the positively selected sites is located on the large 
lumenal loop 1 (Fig. 3C) and does not affect (or introduces) a predicted glycosylation site. Variants 
in the same loop of MHV nps4 were shown to alter DVM morphology or number, irrespective of 
their effect on glycosylation (55). Two additional positively selected sites (G453 and A479) are 
located in the C-terminal domain of the protein (Fig. 3C). This domain is cytosolic and interacts 
with viral and host proteins (56). 

We also detected 4 positively selected sites in the other viral protease, 3CLpro (nsp5) (Fig. 4). 
Interestingly, two of them (H8 and V132) were proposed to affect the correct domain orientation 
required for dimerization of MERS-CoV nsp5 (57). Moreover, site-directed mutagenesis of 3CLpro 
from MHV indicated that the two corresponding sites affect protease activity in a temperature- 
dependent fashion (58, 59) (Fig. 4). 

Concerning selected sites in nsplO and ORFlb, interesting findings relate to the npsl0-nspl4 and 
nspl0-nspl6 interactions. Based on the crystal structure of the corresponding SARS-CoV protein 
complexes, one of the positively selected sites in nspl6 (K249) is located at the direct interaction 
surface with nsplO (Fig. 5A). Also, the only positively selected site we detected in nsplO (S61) is 
involved in the interaction with both nspl6 (60) and nspl4 (61) (Fig. 5). In fact, mutation of residue 
61 in SARS-CoV nsplO (A61V mutant) strongly reduces interaction with nspl4. In MHV and 
SARS-CoV, S61 is located on an exposed loop that also includes Q65 (Fig. 5); mutation of Q65 
results in a temperature sensitive (ts) phenotype in MHV and, through a poorly understood 
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320 mechanism, disrupts nsp5 function (62). This observation suggests that this loop may be involved in 

321 interaction with other nsps, or that aminoacid changes in this loop can modify nsplO conformation 

322 (62). 

323 Additional selected sites in npsl4 and 16 do not involve the contact interface with nsplO. In nspl4, 

324 T228 is located within one of the zinc fingers of the protein (Fig. 5); these domains are essential for 

325 the ExoN proofreading activity of nspl4 (63). Another selected residue (S284) flanks a position 

326 (F286 in SARS, highly conserved in coronaviruses) that is involved in the hydrophobic interaction 

327 between the ExoN and N7-methyltransferase domain (63). 

328 As for nspl6, the T151 selected site is in close proximity to a residue (L153) (Fig. 5) that originates 

329 a ts phenotype in MHV; specifically, the mutation has an effect on RNA synthesis (64). Finally, the 

330 positively selected site 138 (Asn in MERS-CoV) is located on a solvent-exposed flexible loop (not 

331 present in the structure) on the RNA binding groove; this loop may be involved in interaction with 

332 RNA (65). 

333 

334 Discussion 

335 

336 In recent years, the emergence of SARS-CoV and MERS-CoV as dangerous zoonoses stirred great 

337 interest in the ecology and evolution of coronaviruses. Both viruses originated in bats and spread to 

338 humans via an intermediate host (3, 66). This clearly highlights the potential for CoV host shifting 

339 and the relevance of understanding the molecular events underling the adaptation to new species. 

340 Since the identification of MERS-CoV, a number of related viruses were isolated from bats (and 

341 other mammals) all over the world, suggesting a wide distribution of betaCoVs. Unfortunately, most 

342 of these studies reported partial viral sequences, hampering a fully resolved phylogenetic analysis 

343 (67-69). Herein we only included viral species/strains with complete information for ORFla and 
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ORFlb, and the sequences we analyzed are relatively divergent. This may introduce a high false 
positive rate in the inference of positive selection due to two major issues: the unreliability of 
sequence alignments and the saturation of substitution rates. To circumvent these possible problems 
we adopted a stringent alignment filtering criterion and we computed substitution rates over 
phylogenies. This latter procedure allows breaking of long branches, resulting in improved rate 
estimation. In fact, branch-site tests are relatively insensitive to the saturation issue (70). Moreover, 
we supported all claims of positive selection by the combined use of two methods. Although this 
may have resulted in a loss of power (that is intrinsically low for branch site tests (32)), we were 
able to detect several positively selected sites along different branches of the phylogenies. In both 
ORFla and ORFlb episodic selection was particularly strong on the branch leading to bat CoVs 
(F1KU4, F1KU5, and BtCoV133), possibly reflecting specific characteristics of these host species 
(large population sizes, high seroprevalence, and wide geographic distributions). 

Recently, a study of the spike (S) protein in MERS-CoV-related viruses analyzed a very similar set 
of viral sequences as the ones herein and detected positive selection at 9 sites (25). The proportion 
of selected sites we identified in the ORFla region is much higher than that for the S gene. Albeit 
counter-intuitive, as the S protein is exposed on the viral surface and functions as a central 
determinant of host range and of antibody response, this observation indicates that ORFla 
represented a major target of positive selection during the evolution of lineage C betaCoVs. 
Therefore, adaptive evolution in this region might contribute to host shifts or immune evasion. In 
line with these results, studies on avian influenza A viruses showed that viral surface proteins are 
not the sole determinants of host range. Conversely, adaptation of avian influenza to mammalian 
hosts was found to be critically dependent on changes in the polymerase genes (23, 24). Recently, 
an extended analysis indicated that the adaptation of avian influenza virus to swine was 
accompanied by substitutions in almost all viral genes. Mutation accrual continued for a long time 
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368 after the host shift, suggesting that multiple mutations progressively optimize viral fitness in the 

369 new host (71); many of these mutations were suggested to represent compensatory or epistatic 

370 changes (71). Interestingly, we also found evidence of co-evolution between site pairs, possibly 

371 suggesting epistatic interactions. The vast majority of co-evolving sites were located in nsp3, in line 

372 with the view that epistatically interacting substitutions are enriched in protein regions undergoing 

373 adaptive evolution (72). Epistasis is very common in viruses and is thought to play an important 

374 role in the evolution of immune evasion, host shifts, and drug resistance (73, 74); this latter is likely 

375 not at play in the case of MERS-CoV and related viruses. Previous studies on RNA viruses have 

376 mainly focused on the effect of epistatic interaction on the emergence of drug resistance (e.g. 

377 influenza virus hemagglutinin and neuraminidase, E1CV NS3 protease, HIV-1 protease and reverse 

378 transcriptase) or antibody escape (influenza virus hemagglutinin and neuraminidase) (75-78). Little 

379 information is available on the possible role of co-evoloving sites in the context of innate immunity 

380 or adaptation to new hosts. Nonetheless, epistatic interaction between two sites in the Chikungunya 

381 virus El envelope glycoprotein underlies the ability of the pathogen to infect a new mosquito vector 

382 (79). These observations suggest that intra- and inter-gene epistasis contributes to determine the 

383 evolutionary trajectories of viral species whenever the environment (broadly defined, including the 

384 host species) changes (74). The co-evolving sites we identified are located in distinct domains of 

385 nsp3 and the pair in nspl2 and nspl4 does not involve residues at the interaction surface between 

386 the two proteins. Thus, we are presently unable to infer the molecular mechanism underlying their 

387 interaction. The generation of mutant viruses or recombinant viral proteins carrying different amino 

388 acids at co-evolving site pairs will be necessary to address the nature of their interaction and the 

389 effect on the viral phenotype. We note, however, that long-range interaction are likely to be 

390 common in ORFla and ORFlb, both at the intra- and inter-protein level, as demonstrated by 

391 experiments with SARS-CoV and MHV mutants (58, 62). For instance, a temperature sensitive (ts) 
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392 mutation in MHV nsplO affects viral replication by blocking the activity of nsp5 (62). Also, ts 

393 mutations in MHV nsp5 can be rescued by second-site suppressor changes located in physically 

394 distant protein regions (58). 

395 The CoV RTC is extremely complex and comprises a number of enzymatic activities that constitute 

396 a unique repertoire among RNA viruses. Several molecular details on the assembly and functioning 

397 of the RTC are still missing and the mechanisms underlying the phenotypic effects of specific 

398 changes are often poorly understood. For instance, Stokes and coworkers described a conservative 

399 mutation (M575I) located in the N-terminal portion of the ADPR domain of MHV nps3 that 

400 determines a ts phenotype with impaired RNA synthesis (80). How the mutation exerts its effects, 

401 though, remains unknown. We thus interpreted the effect of positively selected sites in light of what 

402 is known of CoV biology, and we compared the location of selected sites to the few substitutions (in 

403 either SARS-CoV or MHV) with a known effect, as detailed above. Clearly, the selected sites we 

404 identified herein lend themselves to experimental testing to assess their impact on BetaCoV 

405 phenotypes. Indeed, evolutionary analyses can provide information on the location and nature of 

406 adaptive changes, thus highlighting the presence of functional genetic variants. For instance, the 

407 major selection target, nsp3, has multiple enzymatic functions. It would be extremely interesting to 

408 use site-directed mutagenesis and biochemical analyses to assess whether the selected sites in the 

409 PLpro domain affect the enzyme's specificity not only towards the viral polyprotein, but also in 

410 terms of deubiquitination and delSGylation activities, which, in turn, may modulate the viral ability 

411 to evade host immune responses (17). Likewise, expression of mutant nsp3 and nsp4 proteins 

412 carrying different aminoacids at the positively selected sites in the large luminal loops will be 

413 instrumental to determine whether, as shown for other CoVs (54, 55), these changes affect the 

414 formation of membrane rearrangements onto which the RTC assembles. Ultimately, the generation 

415 of mutant CoVs followed by in vitro infection will clarify the effect of specific changes on viral 
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416 fitness, at least in cell cultures. 

417 It is noteworthy that we found nsp3 to represent a preferential target of positive selection and that 

418 the adaptive process is ongoing in circulating MERS-CoV strains. A previous genome-wide study of 

419 positive selection in MERS-CoV detected only one positively selected site in the S gene (2); 

420 however, fewer strains than those we analyzed herein were available at the time of the study and the 

421 authors did not include sequences obtained from camels (2). The two selected sites we identified are 

422 located in the PLpro domain of nsp3 and, together with two sites selected in bats, map to the “palm” 

423 portion of the right-handed structure of the protease. The palm also accommodates the catalytic 

424 triad, but the selected sites are located at the opposite side of the crevice, suggesting that they may 

425 exert an effect by altering the conformational structure of the protease or via interaction with other 

426 viral components or host proteins. In addition to its role in viral replication PLpro functions as a 

427 multitasking inhibitor of IFN responses and physically associates with several host innate immunity 

428 molecules (20). Because antagonism of the host immune system is considered a major driver of 

429 evolutionary change in viruses (81), the G720K variant is an excellent candidate as a modulator of 

430 host responses. Elowever, variability at position 720 is also observed among viruses isolated from 

431 camels, suggesting that the selective event ensued in these animals or in a previous host and does 

432 not represent a specific adaptation to humans. A similar observation was previously reported for the 

433 positively selected site in the S protein (25). Conversely, variation at the second selected site (R911) 

434 in PLpro was only observed for viruses isolated from human patients. Although this finding may 

435 simply reflect the sparse sampling of camel-derived viruses, adaptation of a zoonotic virus to a new 

436 host is expected to result in selective events that optimize viral fitness in terms of replication 

437 efficiency, transmissibility, and immune evasion (73). 

438 In general, deeper understanding of the adaptive events that underlie host shifts in CoVs and other 

439 viruses will be pivotal to predict and prevent future zoonoses. Bats alone host a variety of CoVs that 
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440 represent potential threats to human health (67-69). Data herein suggest that monitoring programs 

441 and field surveys of CoV diversity and prevalence should envisage molecular characterization of 

442 nsp3. 
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Figure Legends 


Figure 1. (A) Schematic representation of ORFla/ORFlb and their nsps products, nsps are colored 
in hues of blue depending on the percentage of negatively selected sites, nspll is shown in gray 
because it was not analyzed (because it is too short). Positively selected sites are represented by 
triangles, with color corresponding to the selected branch in the phylogeny (see panel B). Co¬ 
evolving sites are shown below the nsp structure, with different symbols indicating each pair of co¬ 
evolving sites. (B) Maximum likelihood phylogenies for ORFla (left) and ORFlb (right) in lineage 
C betaCoVs. Branches set as foreground lineages in independent branch-site tests are highlighted 
with different colors and numbered. Thick branches yielded statistically significant evidence of 
positive selection. Branch length is proportional to synonymous substitution rate (dS). Co-evolving 
sites are also reported, with different symbols as indicated in panel A. Positions are relative to each 
nsp, see also Table S3. 

Figure 2. Maximum likelihood phylogeny for nsp3 sequences in a subset of isolates representing 
MERS-CoV major groups. The amino acid alignment of the region surrounding the two positively 
selected sites (magenta) in MERS-CoV isolates is also shown. Asterisks denote viruses isolated 
from dromedary camels. 

Figure 3. (A) Representation of nsp3 domain architecture. Positively selected sites are indicated by 
triangles, co-evolving sites with symbols (see Figure 1 legend). In the enlargements, positively 
selected sites were mapped onto known domain 3D-structures of MERS-CoV or SARS-CoV (PDB 
IDs: 2GRI, 3EWR, 4RNA, and 2K87). The acidic domain is shown in gray because it was not 
analyzed (see text). Topology maps and probability diagrams of transmembrane helices for MERS- 
CoV nsp3 transmembrane domain (B) and MERS-CoV nsp4 (C). The conserved cysteine residues 
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486 and the predicted N-glycosylation sites are mapped onto the lumenal loops. Color codes are as in 

487 Figure 1; yellow denotes protein regions or sites known to be functional and mentioned in the text. 

488 

489 Figure 4. (A) Structure of the dimeric form of MERS-CoV nsp5 (PDB ID: 4YLU). Positively 

490 selected sites affecting the dimerization process are labeled onto the structure. Color codes are as in 

491 Figure 1. Catalytic residues are in yellow. (B) Amino acid alignment of the nsp5 regions 

492 surrounding selected sites. Residues that confer a temperature sensitive phenotype when mutated in 

493 MF1V are underlined (see text). 

494 

495 Figure 5. Ribbon representation of SARS-CoV nspl0-nspl6 complex (PDB ID: 2XYQ) (A), and 

496 SARS-CoV nspl0-nspl4 complex (PDB ID: 5C8U ) (B). Positively selected sites in lineage C 

497 betaCoVs are shown in green (see Figure 1), residue involved in inter- or intra-protein interactions 

498 are shown in yellow. An amino acid alignment of the nsplO exposed loop containing S61 and Q65 

499 is also shown. Functional residues in MHV/SARS-CoV are underlined (see text). 
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Table 1. Likelihood ratio test statistics for branch-site tests. 


Region 

Foreground 

branch 3 

-2AlnL b 
(MA vs 
MA1) C 

MA vs MAI p value 
(FDR corrected p 
value) 

-2AlnL b 

(BUSTED) 

BUSTED 
p value 

N. of sites 
identified by BEB 
and BUSTED 

ORFla 


Branch 18 

4.35 

0.0369 (0.0369) 

0.50 

0.778 

- 


Branch 17 

49.18 

2.34xl0‘ 12 (3.90xl0‘ 12 ) 

23.18 

9.25xl0‘ 6 

1 (nsp3) 


Branch 16 

75.22 

4.21xl0‘ ls (1.05xl0' 17 ) 

85.9 

2.58xl0' 19 

13 (1 nspl, 1 nsp2, 

8 nsp3, 2 nsp4, 1 
nsp8) 


Branch 23 

93.39 

4.29xl0‘ 22 (2.14xl0' 21 ) 

71.76 

2.61xl0‘ 16 

41 (6 nsp2, 25 nsp3, 
4 nsp4, 4 nsp5, 1 
nsp6, 1 nsplO) 


Branch 25 

26.84 

2.20x1 O' 7 (2.75xl0‘ 7 ) 

7.52 

0.023 

- 

ORFlb 


Branch 18 

0.78 

0.38 (0.475) 

- 

- 

- 


Branch 17 

8.53 

0.0035 (0.0058) 

2.8 

0.246 

- 


Branch 16 

12.70 

3.6x1 O' 4 (0.0009) 

17.36 

1.70x1 O' 4 

1(nspl2) 


Branch 23 

37.83 

7.73xlO' 10 (3.86xl0' 9 ) 

31.94 

1.16xl0' 7 

11 (2 nspl2, 1 
nspl3, 3 nspl4, 1 
nspl5, 4 nspl6) 


Branch 25 

0.23 

0.63 (0.63) 

- 

- 

- 


a Branches are numbered as in figure 1. 

b 2AlnL is twice the difference of the natural logs of the maximum likelihood of the models being compared. 
c MA and MAI are branch-site models that assume four classes of sites: the MA model allows a proportion of codons to 
have co> 1 on the foreground branches, whereas the MAI model does not. 
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Table 2. Likelihood ratio test statistics for models of variable selective pressure among sites. 


region 

LRT model 

-2AlnL c 

p value 

% of sites 
(average to) 

Positively selected 
sites 

(BEB and MEME) 

nsp3 


Mia vs M2a“ 

M7 vs M8 b 

15.25 

14.15 

0.00048 

0.00085 

0.6(17.24) 

1.6(11.59) 

G720, R911 


a Mia is a nearly neutral model that assumes one co class between 0 and 1, and one class with co=l; M2a (positive 
selection model) is the same as Mia plus an extra class of co >1. 

b M7 is a null model that assumes that 0<co<l is beta distributed among sites; M8 (positive selection model) is the same 
as M7 but also includes an extra category of sites with co> 1. 

c 2AlnL: twice the difference of the natural logs of the maximum likelihood of the models being compared. 




